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Abstract 

It is common knowledge that desktop computing power 
is now increasing mainly by the change to multi-core 
chips. This is a challenge for the software community 
in general, but is a particular problem for audio process¬ 
ing. Our needs are increasingly towards real-time and 
low latency. We propose a number of possible paths that 
need investigation, including multi-core and special ac¬ 
celerators, which may offer useful new musical tools. We 
define this as High-Performance Audio Computing, or 
HiPAC, in analogy to current HPC activity, and indicate 
some on-going work. 
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1 Introduction 

We have been particularly struck by the paper by 
James Moorer[l] where he states 

This is compute power far beyond what 
even the most starry-eyed fortune-teller 
could have imagined! It will change the 
very nature of what audio is, and what 
audio engineers do, since it changes what 
is possible at a fundamental level. 

In pursuit of this aim we propose the topic of 
HiPAC — High-Performance Audio Computing — 
as the study of new advanced processor architec¬ 
tures to enhance audio synthesis, processing and 
music composition. The name is a deliberate echo 
of the field of general High-Performance Comput¬ 
ing (HPC), but there are significant differences. 
Rather than considering the use of supercomput¬ 
ers and megaclusters we arc concentrating on what 
will within a relatively short timescale be consumer- 
grade hai'dware. The emphasis has to be on afford¬ 
able low latency real-time processing. 

The long-established Moore’s law, the doubling 
of the number of transistors in a chip every 18 
months, has persisted as an indirect measure of 
processing power, so long as the emphasis has 


remained on increasing the processing power of 
a single CPU core. However, obtaining further 
speed from a single core (demanding not least, in¬ 
creased clock speeds) incurs a significant increase in 
power requirements, expressed not only in terms of 
wattage but also in terms of heat generation, which 
in turn demands further power for its dissipation us¬ 
ing air or other cooling. This high power consump¬ 
tion is not only directly costly, it also implies an in¬ 
creasingly unacceptable “carbon cost”, and for the 
musician, noise. The universally recognised solu¬ 
tion is to increase the number of processing cores, 
so that more computations can be made concur¬ 
rently. This move towards parallel computation has 
profound implications for both users and program¬ 
mers. In this respect, music and audio processing 
raises a number of interesting challenges and oppor¬ 
tunities. One the one hand, a digital audio stream 
is by definition serial, naturally suggesting a serial 
(if fast) mode of computation; on the other, practi¬ 
cal audio processes may involve a large number of 
identical and often independent tasks which can in 
principle run concurrently. 

For many musicians, the availability of higher 
computing power has typically been measured in 
terms of the number of demanding processes (such 
as reverberation, or synthesiser voices) that can be 
performed in real time. While this remains an im¬ 
portant measure, we are interested more in the pos¬ 
sibility of exploring in real time processes that have 
hitherto been disregarded as too computationally 
demanding but which the next generation of hard¬ 
ware could bring into to the realm of real-time exe¬ 
cution. 

2 The Hardware — from Supercomputers 
to the Desktop 

For a period the interest was in “supercomputing”, 
as typified by vector and array processors, like the 
original Cray 1, built in 1976 for the Los Alamos 
National Laboratory[2] and specified to reach the 
then astonishing speed of 160Mflops and an almost 
equally astonishing main memory of 8Mbytes. The 



machine solved a central issue for any cluster, of 
inter-node communication, by ensuring that no con¬ 
nection was longer than 4 feet. The resulting horse¬ 
shoe shape became for the general public almost the 
defining iconic image of a “supercomputer”. 

At a similar time in the UK there were many al¬ 
ternative designs, such as dataflow [3] and the ICL 
DAP [4]. 

There was a brief flowering of research into par¬ 
allel computing, and much discussion on meshes 
and hyper-cubes. A key aspect was the use of a 
Single-Instruction-Multiple-Data (SIMD) computa¬ 
tional mode, which is unsuited to computations dis¬ 
tributed over physically disparate nodes. This leads 
to a number of synchronous processing elements, 
with a common clock, which must be physically 
very close - same board or chip. The concept of 
SIMD still dominates much computing, even in the 
SSE instructions in recent chips. 

Early MIMD (Multiple-Instruction-Multiple- 
Data) processing was beset by communication costs 
and organisational issues. However it is this style 
that led the revival of interest, with the Beowulf 
cluster approach, using a network of commodity 
PCs. 

In contrast, modern SIMD fine-grained parallel 
architectures arc associated today firstly with the 
vector extensions to standard CPUs ( e.g. Altivec 
on the PPC, SSE on x86) whereby several arith¬ 
metic units arc employed in parallel, and secondly 
with the graphics accelerator cards now essential to 
all consumer workstations (especially where high 
performance in games is required). The difference 
clearly is that these SIMD systems arc typically 
monolithic, implemented in one chip. An example 
of this approach is the IBM Cell Broadband Proces¬ 
sor^] employed in the Sony Playstation. This chip 
employs a conventional PowerPC master CPU cou¬ 
pled with 8 parallel floating-point PEs. 

In the attempt to maintain a generational perfor¬ 
mance increase, chip manufacturers arc already sup¬ 
plying devices and motherboards supporting multi¬ 
ple CPU cores (currently with a maximum of four 
cores per chip, in the case of the processors from 
Intel and AMD), while investigating designs which 
significantly increase the number of cores. Intel, 
for example, have publicised a development chip 
featuring 80 cores, and claim performance up to 1 
Teraflop, while indicating that a commercial release 
may still be 10 years away. When associated with 
cluster computing we obtain the current batch of su¬ 
per computers - indeed it is now a mark of pride 
to have a cluster with more cores than ones neigh¬ 
bours. 


At the same time a highly significant market 
has arisen for SIMD-style accelerator systems de¬ 
signed to work in conjunction with a host computer, 
and targeted at the HPC community. One signif¬ 
icant example is the Tesla accelerator series from 
nVidia [6]. The company have for many years pro¬ 
vided a SDK for several of their GPU-based video 
cards, featuring their CUDA development environ¬ 
ment. The Tesla series represents their first general- 
purpose accelerator product not specifically devel¬ 
oped as a graphics accelerator, while based on the 
same technology. The Graphics Processing Unit 
(GPU) has already attracted the interest of a num¬ 
ber of audio researchers {e.g. [7]), especially for the 
computation of 3D acoustic modelling tasks. A sec¬ 
ond example, is the “Advance” floating-point accel¬ 
erator card from the British startup Clearspeed[8]. 
Each card features two of their CSX600 chips, each 
offering 96 PEs supporting double-precision float¬ 
ing point computation, with very low power con¬ 
sumption of some 30W per card. Each card pro¬ 
vides a sustained computation speed of 50 Gflops. 
Both the nVidia and Clearspeed products provide 
automatic acceleration support for Matlab, enabling 
them to be rapidly integrated into an existing HPC 
cluster. 

Clearspeed’s most recent device, the CSX700, 
merges the two CSX600 cores onto one chip, and 
the new PCIe-based cards employing them are con¬ 
siderably cheaper, making them directly competi¬ 
tive with the nVidia Tesla product, and thus signifi¬ 
cantly more accessible to the individual user. 

In a parallel development, FPGA devices have 
increased substantially in both size and speed, of¬ 
fering a powerful alternative path to custom DSP 
chip design for both academic research[9] and in¬ 
dustry. For example, the new “Crystal Core” sys¬ 
tem from Fairlight is based on a dynamically recon- 
figurable FPGA device [10]. It is also of interest to 
note the use of special “physics” co-processors in 
some game computers. 

3 The software 

Over time there have been many attempts at soft¬ 
ware aimed at parallel execution schemes, often tied 
to particular' hardware. 

Of particular note is the now defunct Inmos 
Transputer, almost unique in being closely associ¬ 
ated with a dedicated concurrent programming lan¬ 
guage Occam[ll]. 

This in turn was based on the formal language 
Communicating Sequential Processes (CSP) de- 

1 An insider’s history of the Transputer can be found in [12; 
13] 



vised by Hoare[14], and which is still highly influ¬ 
ential in the field of concurrent and parallel com¬ 
putation. Amongst computer musicians, the Trans¬ 
puter is of special note for its use in the first 
real-time parallel implementation of Csound[15], 
involving some 170 Transputers, though it is re¬ 
counted that the problem of heat dissipation was 
never resolved. The Transputer however lives on 
in emulation, for example in the forms of an FPGA- 
based device [16] and of a software emulation of the 
instruction set supporting Occam[17]. 

There is a tension between using a language 
that directly supports a parallel and concurrent 
paradigm, and retaining the ubiquity of the C and 
C++ languages, neither of which embodies any ex¬ 
plicit support for parallelism. In the latter case 
the trend is to a dependence on general thread¬ 
ing libraries (such as POSIX pthreads for C, or 
the Boost C++ threading library), or on language 
enhancements such as OpenMP, that may or may 
not make explicit use of platform-specific facili¬ 
ties (with the many well-known issues associated 
with multi-threaded programming using these lan¬ 
guages). 

The prevailing trend is the emergence of a vari¬ 
ety of custom extensions to C, usually serving par¬ 
ticular hardware. For example, in a fashion simi¬ 
lar - to the MasPar, the Clearspeed hardware is sup¬ 
ported by an extended C compiler supporting a cus¬ 
tom keyword poly defining a variable stored (with 
unique values) on each of the 96 PEs and referenced 
as a single entity. Further extensions and library 
functions deal with the sometimes complex tasks 
of transferring data between PEs, and between poly 
and mono (conventional) memory. With respect to 
FPGA programming, the company Celoxica pro¬ 
vides the language Handel-C (following CSP prin¬ 
ciples) enabling a range of FPGA development sys¬ 
tems to be programmed using a similarly extended 
C language supporting both coarse and fine grained 
concurrency by means of wait and par keywords 
[18]. 

This very condensed and selective snapshot of the 
fields of hardware and software support for concur¬ 
rent processing leads us to a consideration of the 
particular challenges presented by audio. In this 
regal'd, we must recall the inherently serial (one¬ 
dimensional) nature of raw audio data, which would 
seem to impose significant constraints on the full ex¬ 
ploitation of parallel processing. Many of the funda¬ 
mental processes in which we arc interested, such as 
recursive filters, arc inherently data-dependent and 
(taking the example of a plain first-order HR filter) 
therefore by definition un-parallelisable. The re¬ 


lationship between sequential and parallel compu¬ 
tation is summarised in Amdahl’s law [19], which 
may be stated as the speedup being 

1/(5 + P/N) 

where 5 is the fraction of serial computation, P = 
1 — 5 is the amount of parallelisablc computation 
and N is number of processors. 

The limiting value is therefore 1/5 for an infi¬ 
nite number of processors; that is, computation will 
be dominated by the sequential subset. This law 
(which has been applied in such areas as business 
and project management as well as in computing) 
suggests some serious limits on how much speedup 
can be obtained from parallel processing. However, 
this has more recently been demonstrated to be an 
overly conservative estimate [20], with respect to 
a hypercube-based system. For audio processes an 
equivalent to Amdahl’s law is as yet undefined, and 
would seem to be highly dependent on the charac¬ 
teristics of the architecture, as well as on the na¬ 
ture of the process itself. Clearly, a process com¬ 
prised mostly of recursive processes will gain rela¬ 
tively little speedup (dependent entirely on the map¬ 
ping of the number of recursive streams to the num¬ 
ber of processors). On the other hand, many audio 
processes arc inherently highly parallel isablc with 
few or no data dependencies, most obviously frame- 
based analysis techniques such as the DFT. Such al¬ 
gorithms can be expected to derive the maximum 
benefit from parallel computation using SIMD mod¬ 
els. As already indicated in the examples of graphic 
modelling, audio processes based on physical mod¬ 
els (e.g. finite element networks) similarly offer a 
very good fit to a parallel computational model. One 
factor that must be borne in mind is that the clock 
speeds of such devices (rated in MHz rather than 
GHz), as also of most DSP chips, are often signifi¬ 
cantly lower than those currently employed in desk¬ 
top PCs. The overall computation speed of such 
processors depends almost entirely on parallel com¬ 
putation, such that serial computation needs to be 
kept to an absolute minimum. 

As the current technology as exemplified in the 
systems described above is seeking to reach and ex¬ 
ceed 1 Teraflop speeds, we argue that it is now time 
to revisit audio processes hitherto disregarded on 
account of their computational demands. Even if 
they are time-consuming today, they will not be in 
only a few years, so that we must start investigating 
them now. By the same argument, we expect hard¬ 
ware that is currently relatively expensive to fall to 
commodity prices and therefore to become accessi¬ 
ble to everyone. In particular - , we advocate in the 



HiPAC program the study of no-compromise algo¬ 
rithms - rather than make simplifications to an al¬ 
gorithm purely for reasons of slowness, HiPAC con¬ 
siders such algorithms in as pure or “ideal” a form 
as possible, especially where that ideal form may 
lead to musically useful and novel behaviour. 

We present below two contrasting case studies 
which we have started. 

We can summarise the primary defining charac¬ 
teristics of a pure HiPAC DSP process: 

• use of highly parallel fine-grained architectures 
( e.g . following the SIMD model), though we 
do not exclude more “conventional” multi-core 
computation 

• real-time performance or better 

• implies low latency 

• ideal and “no-compromise” forms of algo¬ 
rithms 

• new processes, and hence new effects and 
sounds, not simply “more of the same” - 
whether more reverbs or more voices. 

Finally we note that there arc alternative forms of 
parallel and distributed computation that can be ap¬ 
plied to audio, such as grids, web services[21] and 
clusters. We do not consider these further in this pa¬ 
per, but suspect they will eventually find a place in 
the galaxy. 

4 A HiPAC case study - the Sliding Phase 
Vocoder 

In the paper [22] we identified the The Sliding Phase 
Vocoder (SPV) as a canonical example of a HiPAC 
process. A fuller description of the SPV and its pre¬ 
cursor SDFT can be found in that paper and else¬ 
where [23; 24; 25] 

We focus here on the HiPAC aspects that relate 
to the SPV The conventional phase vocoder trans¬ 
forms audio into the frequency domain by means 
of a series of overlapping Fourier transforms (using 
the FFT algorithm). All practical implementations 
(and especially where real-time performance is re¬ 
quired) overlap analysis frames by some small frac¬ 
tion of the window size. For a given sample rate R, 
the analysis rate A is given by the overlap length D 
in samples, as A = R/D. This can be described 
as the “hopping phase vocoder”. It is well known 
that increasing the overlap leads to improved sonic 
performance, but at a directly increasing computa¬ 
tional cost. In the limit, as D —> 1, the “ideal” 
phase vocoder overlaps frames by one sample, so 
that A = R. While recognised in the literature 
as offering sonic advantages, this sliding form has 
hitherto been avoided in practice as being computa¬ 
tionally prohibitive. 


Our implementation of the SPV is based on the 
use of the Sliding DFT, in which the DFT frame is 
updated every sample by a simple complex rotation 
of the bin values 

Ft+iin) = (. F t (n ) - f t + f t+N ) e 

It should be noted that this process is inherently par¬ 
allel between the bins. It has also been found to re¬ 
duce the latency by as much as 75% compared to 
conventional pvoc. The computational demands arc 
greatly increased, it being in effect an N 2 process. 
However, most transformations applied to analysis 
frames arc also parallclisable (often involving stan¬ 
dard vector arithmetic operations). 

We have also shown that pitch shifting is a 
much simpler process compared to the hopping 
pvoc, since the frequency range of each bin covers 
the whole audio range (resynthesis is by oscillator 
bank). This enables, for example, audio-rate Fre¬ 
quency Modulation to be applied cleanly to an arbi¬ 
trary input, a process we have termed Transforma¬ 
tional FM[22], The single-sample update (permit¬ 
ting high modulation rates) is essential to the imple¬ 
mentation of audio-rate FM, in both time and fre¬ 
quency domain forms; thus TFM is an example of 
a frequency-domain process that cannot be imple¬ 
mented using the hopping pvoc. 

It should be noted that these algorithms arc avail¬ 
able in Csound5, but the dismal performance means 
that it is unlikely to be used in any but off-line pro¬ 
cessing. 

In terms of the HiPAC criteria listed above, we 
can see that they are all met by the SPV: 

• highly parallelisahlc 

• streamable in real time given fast enough hard¬ 
ware 

• lower latency compared to conventional pvoc 

• an “ideal” or no-compromise version of pvoc 

• enables a new class of transformation, TFM, 
not realisable using standard pvoc. 

We arc investigating the use of the Clearspeed ac¬ 
celerator chips in implementing a viable version of 
this algorithm. It has been shown that this hardware 
can support significantly in excess of a hundred os¬ 
cillators, so seems like a good match. 

5 A HiPAC case study - Parallelising 
Csound 

A slightly different case is exemplified by Csound. 
There is a multitude of music pieces that depend on 
the syntax and semantics of the Csound system, and 
it is a cardinal feature of the system that existing 
pieces must not be broken by developments [26]. If 
this synthesis language is to work in a world where 



everyone have a multicore computer we need to find 
a way to harness multiple processes. It is simple 
to use one processor for the GUI and one for the 
processing, but this is hardly sufficient. We need to 
split the audio processing itself between processors 
or threads. 

There have been at least three previous attempts 
at a parallel Csound. We have already cited the 
Transputer Array project. A similar project was 
Midas [27], where the processing was developed as 
a flowgraph of unit generators, and then mapping 
them to a network of SGI workstations. This pro¬ 
cess has an inherent latency, and the communica¬ 
tion costs tend to dominate. It would be good to ex¬ 
periment with this scheme, but it seems unavailable 
now. 

Another less documented system is that of Ver- 
coe, where Extended Csound was implemented on 
a small number of SHARC processes, with a fixed 
allocation of instruments to processors, and a ren¬ 
dezvous point at the end of each kontrol cycle. This 
scheme is also found in the later work described at 
Sounds Electric[28]. The model restricts the lan¬ 
guage, and introduces a change in semantics relat¬ 
ing to the order of instruments, but it is a working 
system. 

A full solution needs to take account of gran¬ 
ularity and semantics. The parcels of work need 
to be of sufficient size to overcome the communi¬ 
cation costs, and the semantics defining the order 
of instruments and global variables needs analysis. 
We previously were involved in a parallel simula¬ 
tion project in LISP[29] where we followed the pro¬ 
cess [30] of analysing the program for points of syn¬ 
chronisation, and also evaluated the possible cost of 
individual functions[31]. We arc currently investi¬ 
gating a compiler-technology approach to the par¬ 
allelisation of csound, using the new parser as the 
basis of the flowgraph, and hence of the identifi¬ 
cation of synchronisation, and using valgrind and 
some scripts to count the cost of individual ugens. 
This work is reported in more detail in [32]. 

6 Conclusions 

We define HiPAC as a new class of computationally 
demanding audio processes implemented by means 
of the next generation of parallel processing plat¬ 
forms and tools. We have described the primary 
trends with respect both to desktop computers and 
to hardware accelerators. The latter arc already 
offering close to Teraflop-class computing power. 
We expect the cost of these devices to drop sig¬ 
nificantly within the next decade, so that parallel 
computing will move from the domain of HPC to 


the home desktop. We have presented the Sliding 
Phase Vocoder as a canonical example of a HiPAC 
process. Many other audio processes arc known 
that are well suited to parallel implementation. We 
also outlined a contrasting parallel audio problem, 
utilising multicore processors with exiting applica¬ 
tions. We would also cite in particular processes 
based on physical models (whether of instruments 
or of acoustic spaces), which can be realised not 
only with optimised waveguides, but also in a more 
“ideal” form using finite element models. We sug¬ 
gest that by implementing these computationally in¬ 
tensive processes now, we prepare the ground for 
the immediate exploitation of the next generations 
of parallel computing hardware, which may be with 
us much sooner than we might have supposed only 
a year or two ago. 

This paper is a revision of the paper[33] presented 
at ICMC2009, revised in the light of comments and 
a panel discussion. 
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